rewrite: Pythonic, ocrd v3, utilise page-level annotation #28

bertsky · 2025-02-17T17:31:33Z

First attempt at a full OCR-D processor for this. Builds on core 3.0 – which brings error handling and page parallelism. (For that we require Python instead of bashlib. But Python is already much faster sequentially.)

To deal with the OCR-D annotation – with cropping and deskewing, possibly also binarization and denoising, or even text-image separation (clipped) – it does not suffice to pass the original image and PAGE file name to the converter; instead, one needs to extract/generate the derived image for the page, and then transform all coordinates of the PAGE accordingly. This is borrowed from ocrd-segment-replace-original.
Ships PRImA PDF converter as package data
Can cope with METS Server – thus, get_metadata (with its XML tree operations) needs to read from a filesystem copy of the METS instead of the ClientSideOcrdMets
I kept the multipagepdf code, but separated the functions
negative2zero was too simplistic. In OCR-D, we have the PageValidator against all kinds of coordinate invalidities and inconsistencies. I borrowed from ocrd-segment-repair for actual repairs (although this is debatable, we should keep this as a separate step; also, I had to copy and paste a lot of polygon handling code).

However, there still seems to be a problem with the coordinates of the outlines...

Next I'll add further improvements:

filtering / selection of image features (e.g. binarized or not)
font param as installable resmgr resources
further metadata from MODS
logical structMap to bookmarks (outline labels)
option to write page-wise PDFs into temporary storage only
basic tests and CI

This depends on OCR-D/core#1305.

JKamlah · 2025-02-18T07:06:39Z

Thank you very much @bertsky ,
for all your efforts to keep the module up-to-date. If you are satisfied with your updates and the module is compatible with the latest OCR-D standards, I will gladly accept the PR.

…nary) fields

…leGrp for images, no parsing/validation/repair)

…ark labels

…II characters

…add pytest option --workspace for subsets, determine input fileGrp automatically, download and process up to 4 random pages only, test PAGE2PDF and ALTO2PDF, depending on whether PAGE or ALTO is in the workspace

bertsky · 2025-02-22T03:08:56Z

next up:

update Readme
add CI (basically make test PYTEST_ARGS="--workspace all -vv")

Notice that this also supports things like ocrd-altotopdf -I FULLTEXT,ORIGINAL -O DOWNLOAD -P multipage FULLDOWNLOAD now. And both processors add a table of contents now.

I wonder if the multipage file ID should really be specified manually. It may be difficult to come up with a non-conflicting name in a scripted setting. In contrast, the tool itself could try with mets.unique_identifier or identifiers from MODS, and convert these to a safe XML ID. (So the multipage parameter would just become a boolean.) What do you think?

JKamlah · 2025-02-25T07:53:48Z

Thanks @bertsky for all your work. We still have to decide how we want to proceed with the repository in general. I hope that we will have made a decision by the end of the week.

bertsky · 2025-03-04T15:30:44Z

next up:

done. I have also added continuous deployment. You would need to add the following in the repo settings to make everything work:

add an environment secret DOCKERHUB_USERNAME
add an environment secret DOCKERHUB_PASSWORD
log in to PyPI.org and create a security token, copy...
...and paste as new environment secret PYPI_TOKEN

We still have to decide how we want to proceed with the repository in general.

What do you mean?

stweil · 2025-03-04T17:15:47Z

We still have to decide how we want to proceed with the repository in general.

What do you mean?

We are talking with the OCR-D coordination team about moving this repository to https://github.com/OCR-D/.

bertsky · 2025-03-04T17:29:29Z

We are talking with the OCR-D coordination team about moving this repository to https://github.com/OCR-D/.

I see. If and when that's certain, please let me know so I can adapt the upstream URLs (packaging, CI+CD) before this is merged.

JKamlah · 2025-03-05T16:56:40Z

Edited: We have decided to move the repository to https://github.com/OCR-D/. @kba should now be able to carry out the transfer at the right time.

bertsky · 2025-03-05T17:05:48Z

We have now decided to move the repository to https://github.com/OCR-D/ in the next few days. Do you need more time to adapt the code?

done: 7021614

bertsky added 2 commits February 17, 2025 14:16

Pythonic (and ocrd>=3.0) rewrite

ce689f7

re-add ocrd-tool.json as symlink

c2f874d

bertsky added 23 commits February 18, 2025 15:03

add params image_feature_filter/selector

1971737

fix outline coordinates (by updating ' Page/@image_*')

6a82bfb

multipage: raise instead of log when gs fails

f161ef5

multipage metadata: utilise more DOCINFO (Document Information Dictio…

ae8d86e

…nary) fields

check if any text exists on textequiv_level, warn if not

14b2f9f

add parameter 'multipage_only', removing single-page files finally

6076a96

title metadata: avoid relatedItem

9d3e30b

producer metadata: use pkg name and version

69786ce

pagelabel parameter: add pagelabel value (using @ORDER/LABEL)

7d8e141

add ALTO2PDF processor (converting ALTO→PAGE first, using 2. input fi…

dda1768

…leGrp for images, no parsing/validation/repair)

multipage: escape/encode strings properly

a779985

multipage: add MODS as extra XMP file, add logical structMap as bookm…

ef46e63

…ark labels

finalize processor docstrings

7c89da8

update makefile/dockerfile

4cb800c

multipage: do not string-format MODS XMP stream, but do avoid non-ASC…

1918a0e

…II characters

multipage: fix MODS author retrieval

4f6e2e3

multipage: make logical structMap parser more robust

a1cba43

add some fonts as downloadable resources

0591030

refactor to avoid get_physical_pages on ClientSideOcrdMets

d04fb69

add basic tests

dfaa249

altotopdf: improve logging

181f9b3

tests: work around core#1149 by downloading remotely

e649b8f

bertsky marked this pull request as ready for review February 22, 2025 03:08

update readme, add CI+CD

df5bfa4

bertsky added 2 commits March 4, 2025 18:00

setuptools: adapt pkg discovery to repo subdirectory

057be92

fix+improve dockerfile

7f0d04c

prepare for Github transfer UB-Mannheim→OCR-D

7021614

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite: Pythonic, ocrd v3, utilise page-level annotation #28

rewrite: Pythonic, ocrd v3, utilise page-level annotation #28

bertsky commented Feb 17, 2025 •

edited

Loading

JKamlah commented Feb 18, 2025

bertsky commented Feb 22, 2025 •

edited

Loading

JKamlah commented Feb 25, 2025

bertsky commented Mar 4, 2025

stweil commented Mar 4, 2025 •

edited

Loading

bertsky commented Mar 4, 2025

JKamlah commented Mar 5, 2025 •

edited

Loading

bertsky commented Mar 5, 2025

rewrite: Pythonic, ocrd v3, utilise page-level annotation #28

Are you sure you want to change the base?

rewrite: Pythonic, ocrd v3, utilise page-level annotation #28

Conversation

bertsky commented Feb 17, 2025 • edited Loading

JKamlah commented Feb 18, 2025

bertsky commented Feb 22, 2025 • edited Loading

JKamlah commented Feb 25, 2025

bertsky commented Mar 4, 2025

stweil commented Mar 4, 2025 • edited Loading

bertsky commented Mar 4, 2025

JKamlah commented Mar 5, 2025 • edited Loading

bertsky commented Mar 5, 2025

bertsky commented Feb 17, 2025 •

edited

Loading

bertsky commented Feb 22, 2025 •

edited

Loading

stweil commented Mar 4, 2025 •

edited

Loading

JKamlah commented Mar 5, 2025 •

edited

Loading